We start off, by fetching the data from wineQualityReds csv file and storing into a variable wineQualityData.
wineQualityData <- read.csv('wineQualityReds.csv', head = TRUE)
summary(wineQualityData)
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
Data: We have 1599 rows of data where X is the unique identifier for each wine. There are 11 metrics which decide the quality of the wine. Quality is an ordered variable where values range from 3 to 8 for our given sent of wines. The mean wine quality is 5.6.
library(ggplot2)
library(gridExtra)
plot_univariate <- function(property, lower_limit, higher_limit, bin_width) {
grid.arrange(ggplot(wineQualityData, aes(x = 1, y = property)) +
geom_boxplot(color = 'black', fill = 'steelblue') +
scale_y_continuous(lim = c(lower_limit, higher_limit)),
ggplot(data = wineQualityData, aes(x = property)) +
geom_histogram(binwidth = bin_width, color = 'black', fill = 'tan1') +
scale_x_continuous(lim = c(lower_limit, higher_limit)),
ncol = 2)
}
plot_bivariate_wrt_quality <- function(property) {
x = seq(property)
grid.arrange(ggplot(wineQualityData, aes(x = x, y = property, color = factor(wineQualityData$quality), shape = factor(wineQualityData$quality))) +
geom_point(size = 2, alpha = 0.4) +
scale_color_identity(guide = 'legend') +
ylab('Property') +
xlab('#'),
ggplot(wineQualityData, aes(x = factor(wineQualityData$quality), y = property)) +
geom_boxplot(),
nrow = 2)
}
We start off by exploring univariate variables to find correlations between the attributes and the quality of a wine.
ggplot(wineQualityData, aes(x = factor(wineQualityData$quality))) +
geom_bar(stat = "count", width = 0.5, fill = "steelblue", color = 'black') +
xlab('Quality') +
ylab('Count') +
theme_minimal()
Majority of the wines have quality between 5 and 6 with very few wines being really good or bad (8 or 3 respectively).
grid.arrange(qplot(wineQualityData$fixed.acidity),
qplot(wineQualityData$volatile.acidity),
qplot(wineQualityData$citric.acid),
qplot(wineQualityData$residual.sugar),
qplot(wineQualityData$chlorides),
qplot(wineQualityData$free.sulfur.dioxide),
qplot(wineQualityData$total.sulfur.dioxide),
qplot(wineQualityData$density),
qplot(wineQualityData$pH),
qplot(wineQualityData$sulphates),
qplot(wineQualityData$alcohol))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We analyze each property individually.
From the above plot, it appears that majority of the values for fixed acidity lie in the range 5 to 14. So we limit our fixed acidity values to this range.
plot_univariate(wineQualityData$fixed.acidity, 5, 14, 1)
## Warning: Removed 11 rows containing non-finite values (stat_boxplot).
## Warning: Removed 11 rows containing non-finite values (stat_bin).
The median for fixed acidity is somewhere around 8 and the graph is positively skewed. Large number of values lie in the range of 7 to 9.
Majority values for volatile acidity lie in the range of 0.2 to 1.
plot_univariate(wineQualityData$volatile.acidity, 0.2, 1, 0.1)
## Warning: Removed 38 rows containing non-finite values (stat_boxplot).
## Warning: Removed 38 rows containing non-finite values (stat_bin).
The median is around .54 and this distribution is also positively skewed.
A lot of citric acid values appear to be zero. The data available for citric acid might be incomplete.
plot_univariate(wineQualityData$citric.acid, 0, 0.75, 0.1)
## Warning: Removed 6 rows containing non-finite values (stat_boxplot).
## Warning: Removed 6 rows containing non-finite values (stat_bin).
The graph for residual sugar is heavily skewed towards the left and most of the data lies in the range 1 to 5.
plot_univariate(wineQualityData$residual.sugar, 1, 5, 0.5)
## Warning: Removed 86 rows containing non-finite values (stat_boxplot).
## Warning: Removed 86 rows containing non-finite values (stat_bin).
Even after filtering some outliers, the data is still positively skewed with a median around 2.25.
The data for chlorides is similar to that of residual sugar. We consider the data that lies between 0.04 and 0.14.
plot_univariate(wineQualityData$chlorides, 0.04, 0.14, 0.01)
## Warning: Removed 81 rows containing non-finite values (stat_boxplot).
## Warning: Removed 81 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing missing values (geom_bar).
The data for this range appears to be normally distributed with a few outliers. The median is around 0.08.
Most of the values for free sulfur dioxide lie in the range of 0 to 35.
plot_univariate(wineQualityData$free.sulfur.dioxide, 0, 35, 2)
## Warning: Removed 77 rows containing non-finite values (stat_boxplot).
## Warning: Removed 77 rows containing non-finite values (stat_bin).
In this property we see a high peak around 7-8 which gives our graph a positive skew. The median, however, is around 13. This is becuase of the long tail of values in the high range.
Most of the values are in the range 0 to 100. Since free sulfur dioxide is a subset of total sulfur dioxide, we can expect to see a similar positively skewed graph for total sulfur dioxide.
plot_univariate(wineQualityData$total.sulfur.dioxide, 0, 100, 5)
## Warning: Removed 127 rows containing non-finite values (stat_boxplot).
## Warning: Removed 127 rows containing non-finite values (stat_bin).
Our expectation was correct in this case, we see a positively skewed graph with a high peak around 25 whereas the median is around 36. We can say that the values for total sulfur dioxide are somewhat proportional to those free sulfur dioxide.
The data for density is normally distributed.
plot_univariate(wineQualityData$density, quantile(wineQualityData$density, 0.025), quantile(wineQualityData$density, 0.975), 0.001)
## Warning: Removed 79 rows containing non-finite values (stat_boxplot).
## Warning: Removed 79 rows containing non-finite values (stat_bin).
Both the median and the mean appear to be around 0.997. So we can positively say that our plot is normally distributed.
The data for pH level is also normally distributed.
plot_univariate(wineQualityData$pH, quantile(wineQualityData$pH, 0.025), quantile(wineQualityData$pH, 0.975), 0.05)
## Warning: Removed 80 rows containing non-finite values (stat_boxplot).
## Warning: Removed 80 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing missing values (geom_bar).
Both the median and the mean appear to be around 3.3. So we can positively say that our plot is normally distributed.
In this case we put our limits at 0.3 and 1.
plot_univariate(wineQualityData$sulphates, 0.3, 1, 0.05)
## Warning: Removed 58 rows containing non-finite values (stat_boxplot).
## Warning: Removed 58 rows containing non-finite values (stat_bin).
Most of the alcohol percentage is around 9 to 11%, which is normal and a few values goind till 13.
plot_univariate(wineQualityData$alcohol, 9, 13, 0.5)
## Warning: Removed 30 rows containing non-finite values (stat_boxplot).
## Warning: Removed 30 rows containing non-finite values (stat_bin).
This graph is positively skewed with a median around 10.2, which is normal beacuse most of the wines have their alcohol percentange in 9% to 11% range.
plot_bivariate_wrt_quality(wineQualityData$density)
From the above plots we can see that wines with higher quality have low median density. We can see a negative correlation between quality and density of a wine.
plot_bivariate_wrt_quality(wineQualityData$alcohol)
Higher quality wines in the dataset have higher alcohol content on average as compared to the lower quality ones. There is a positive correlation between alcohol and quality.
plot_bivariate_wrt_quality(wineQualityData$pH)
Wines are generally acidic in nature which explains that almost all pH levels are below 7 (which is neutral). We can observe that most wines have pH level within range 3 to 4, and there is a slight negative correlation.
plot_bivariate_wrt_quality(wineQualityData$residual.sugar)
There are many outliers for the residual sugar property. Let’s filter out the outliers and plot the values.
x = seq(wineQualityData$residual.sugar)
grid.arrange(ggplot(wineQualityData, aes(x = x, y = wineQualityData$residual.sugar, color = factor(wineQualityData$quality), shape = factor(wineQualityData$quality))) +
geom_point(size = 2, alpha = 0.4) +
scale_color_identity(guide = 'legend') +
ylab('Property') +
ylim(0, 6) +
xlab('#'),
ggplot(wineQualityData, aes(x = factor(wineQualityData$quality), y = wineQualityData$residual.sugar)) +
geom_boxplot() +
ylim(0, 6),
nrow = 2)
## Warning: Removed 48 rows containing missing values (geom_point).
## Warning: Removed 48 rows containing non-finite values (stat_boxplot).
The residual sugar content is almost the same for all qualities of wine.
plot_bivariate_wrt_quality(wineQualityData$sulphates)
Loooks like even suplhates has a lot of outliers, however we can observe a positive correlation from the boxplot. Let’s have a closer look.
x = seq(wineQualityData$sulphates)
grid.arrange(ggplot(wineQualityData, aes(x = x, y = wineQualityData$sulphates, color = factor(wineQualityData$quality), shape = factor(wineQualityData$quality))) +
geom_point(size = 2, alpha = 0.4) +
scale_color_identity(guide = 'legend') +
ylab('Property') +
ylim(0, 1) +
xlab('#'),
ggplot(wineQualityData, aes(x = factor(wineQualityData$quality), y = wineQualityData$sulphates)) +
geom_boxplot() +
ylim(0, 1),
nrow = 2)
## Warning: Removed 58 rows containing missing values (geom_point).
## Warning: Removed 58 rows containing non-finite values (stat_boxplot).
Yes, our observation was correct , better quality wines have higher sulphates content.
correlations <- c(
cor.test(wineQualityData$fixed.acidity, as.numeric(wineQualityData$quality))$estimate,
cor.test(wineQualityData$volatile.acidity, as.numeric(wineQualityData$quality))$estimate,
cor.test(wineQualityData$citric.acid, as.numeric(wineQualityData$quality))$estimate,
cor.test(log10(wineQualityData$residual.sugar), as.numeric(wineQualityData$quality))$estimate,
cor.test(log10(wineQualityData$chlorides), as.numeric(wineQualityData$quality))$estimate,
cor.test(wineQualityData$free.sulfur.dioxide, as.numeric(wineQualityData$quality))$estimate,
cor.test(wineQualityData$total.sulfur.dioxide, as.numeric(wineQualityData$quality))$estimate,
cor.test(wineQualityData$density, as.numeric(wineQualityData$quality))$estimate,
cor.test(wineQualityData$pH, as.numeric(wineQualityData$quality))$estimate,
cor.test(log10(wineQualityData$sulphates), as.numeric(wineQualityData$quality))$estimate,
cor.test(wineQualityData$alcohol, as.numeric(wineQualityData$quality))$estimate)
names(correlations) <- c('fixed.acidity', 'volatile.acidity', 'citric.acid',
'log10.residual.sugar',
'log10.chlordies', 'free.sulfur.dioxide',
'total.sulfur.dioxide', 'density', 'pH',
'log10.sulphates', 'alcohol')
correlations
## fixed.acidity volatile.acidity citric.acid
## 0.12405165 -0.39055778 0.22637251
## log10.residual.sugar log10.chlordies free.sulfur.dioxide
## 0.02353331 -0.17613996 -0.05065606
## total.sulfur.dioxide density pH
## -0.18510029 -0.17491923 -0.05773139
## log10.sulphates alcohol
## 0.30864193 0.47616632
From the above values we can say that alcohol, volatile acidity and sulphates have higher correlation with the qualtiy. We already observed that alcohol and sulphates have positive correlation with quality. Let’s have a look at volatile acidity vs quality.
plot_bivariate_wrt_quality(wineQualityData$volatile.acidity)
Volatile Acidity has a strong negative correlation wrt wine quality.
In the previous section we observed what properties have direct effect on the quality of wines. Let’s have a look at how combinations of these factors affect the quality.
ggplot(wineQualityData, aes(x = wineQualityData$volatile.acidity, y = wineQualityData$alcohol, color = factor(wineQualityData$quality))) +
geom_point(alpha = 0.5) +
scale_color_identity(guide = 'legend')
The above graph shows that wines with higher alcohol content and lower volatile acidity tend to have higher quality rating.
ggplot(wineQualityData, aes(x = wineQualityData$sulphates, y = wineQualityData$alcohol, color = factor(wineQualityData$quality))) +
geom_point(alpha = 0.5) +
scale_color_identity(guide = 'legend')
Good quality wines tend to have lower sulphates level. Based on the past two observations we can expect a graph of sulphates and volatile acidity to have good quality wines to be prevalent in the bottom left of the graph. Let’s have a look.
ggplot(wineQualityData, aes(x = wineQualityData$sulphates, y = wineQualityData$volatile.acidity, color = factor(wineQualityData$quality))) +
geom_point(alpha = 0.4) +
scale_color_identity(guide = 'legend')
This graph stays true to our expectation. A lot of good quality wines lie in the bottom left of the graph.
This graph shows us a strong negative correlation between wine quality and volatile acidity. Better the wine quality, lower the volatile acidity in it.
ggplot(wineQualityData, aes(x = factor(wineQualityData$quality), y = wineQualityData$volatile.acidity)) +
geom_boxplot(color = 'black', fill = 'cadetblue3', alpha = 0.4) +
ylab('Volatile Acidity') +
xlab('Quality')
We observed that alcohol content has a strong postivie correlation with respect to quality. The following graph depicts that.
ggplot(wineQualityData, aes(x = factor(wineQualityData$quality), y = wineQualityData$alcohol)) +
geom_boxplot(color = 'black', fill = 'cadetblue3', alpha = 0.4) +
ylab('Alcohol') +
xlab('Quality')
ggplot(wineQualityData, aes(x = wineQualityData$volatile.acidity, y = wineQualityData$alcohol, color = factor(wineQualityData$quality))) +
geom_point(alpha = 0.5) +
scale_color_identity(guide = 'legend') +
xlab('Alcohol') +
ylab('Volatile Acidity') +
labs(color = 'Quality')
The above plots help us understand that Volatile acidity and alcohol are the major properties that affect the quality of a wine. There are other factors like density, pH level and sulphates that also affect wine quality to some extent.
We were able to figure some properties that might be affecting the quality of a wine. However our dataset only had 1599 different wines, which were produced in a certain region of Portugal, which is much less than the large number of wines that are available in the market. Therefore our analysis need not necessarily apply to wines made in other countries. We also need to understand that the dataset was created by fixed group of individuals and since the taste differs from person to person, the ratings provided by this fixed group of individuals need not necessarily apply to the entire populace.